Search CORE

112 research outputs found

Hardware for the fast computation of the elementary functions

Author: Wong Weng Fai
Publication venue
Publication date: 01/01/1993
Field of study

Thesis (Ph.D. in Engineering)--University of Tsukuba, (B), no. 909, 1993.7.3

Tsukuba Repository

Bit-Width Analysis for General Applications

Author: Ding Yang
Wong Weng Fai
Publication venue
Publication date: 01/01/2005
Field of study

It has been widely known that a significant part of the bits are useless or even unused during the program execution. Bit-width analysis targets at finding the minimum bits needed for each variable in the program, which ensures the execution correctness and resources saving. In this paper, we proposed a static analysis method for bit-widths in general applications, which approximates conservatively at compile time and is independent of runtime conditions. While most related work focus on integer applications, our method is also tailored and applicable to floating point variables, which could be extended to transform floating point number into fixed point numbers together with precision analysis. We used more precise representations for data value ranges of both scalar and array variables. Element level analysis is carried out for arrays. We also suggested an alternative for the standard fixed-point iterations in bi-directional range analysis. These techniques are implemented on the Trimaran compiler structure and tested on a set of benchmarks to show the results.Singapore-MIT Alliance (SMA

DSpace@MIT

Ameliorating the Overhead of Dynamic Optimization

Author: Wong Weng Fai
Zhao Qin
Publication venue
Publication date: 01/01/2005
Field of study

Dynamic optimization has several key advantages. This includes the ability to work on binary code in the absence of sources and to perform optimization across module boundaries. However, it has a significant disadvantage viz-a-viz traditional static optimization: it has a significant runtime overhead. There can be performance gain only if the overhead can be amortized. In this paper, we will quantitatively analyze the runtime overhead introduced by a dynamic optimizer, DynamoRIO. We found that the major overhead does not come from the optimizer's operation. Instead, it comes from the extra code in the code cache added by DynamoRIO. After a detailed analysis, we will propose a method of trace construction that ameliorate the overhead introduced by the dynamic optimizer, thereby reducing the runtime overhead of DynamoRIO. We believe that the result of the study as well as the proposed solution is applicable to other scenarios such as dynamic code translation and managed execution that utilizes a framework similar to that of dynamic optimization.Singapore-MIT Alliance (SMA

DSpace@MIT

Data Prefetching via Off-line Learning

Author: Wong Weng Fai
Publication venue
Publication date: 01/01/2003
Field of study

The widely acknowledged performance gap between processors and memory has been the subject of much research. In the Explicitly Parallel Instruction Computing (EPIC) paradigm, the combination of in-order issue and the presence of a large number of parallel function units has further worsen the problem. Prefetching, by hardware, software or a combination of both, has been one of the primary mechanisms to alleviate this problem. In this talk, we will discuss two prefetching mechanisms, one hardware and other software, suitable for implementation in EPIC processors. Both methods rely on the off-line learning of Markovian predictors. In the hardware mechanism, the predictors are loaded into a table that is used by a prefetch engine. We have shown that the method is particularly effective for prefetching into the L2 cache. Our software mechanism which we called predicated prefetch leverages on informing loads. This is used in conjunction with data remapping and offline learning of Markovian predictors. This distinguishes our approach from early software prefetching techniques that only involves static program analysis. Our experiments show that this framework, together with the algorithms used in it, can effectively remove, in the best instance, 30% of the stall cycles due to cache misses. The results also show that the framework performs better than pure hardware stride predictors and has lower bandwidth and instruction overheads than that of pure software approaches.Singapore-MIT Alliance (SMA

DSpace@MIT

Dynamic Memory Optimization using Pool Allocation and Prefetching

Author: Rabbah Rodric
Wong Weng Fai
Zhao Qin
Publication venue
Publication date: 01/01/2006
Field of study

Heap memory allocation plays an important role in modern applications. Conventional heap allocators, however, generally ignore the underlying memory hierarchy of the system, favoring instead a low runtime overhead and fast response times. Unfortunately, with little concern for the memory hierarchy, the data layout may exhibit poor spatial locality, and degrade cache performance. In this paper, we describe a dynamic heap allocation scheme called pool allocation. The strategy aims to improve cache performance by inspecting memory allocation requests, and allocating memory from appropriate heap pools as dictated by the requesting context. The advantages are two fold. First, by pooling together data with a common context, we expect to improve spatial locality, as data fetched to the caches will contain fewer items from different contexts. If the allocation patterns are closely matched to the traversal patterns, the end result is faster memory performance. Second, by pooling heap objects, we expect access patterns to exhibit more regularity, thus creating more opportunities for data prefetching. Our dynamic memory optimizer exploits the increased regularity to insert prefetch instructions at runtime. The optimizations are implemented in DynamoRIO, a dynamic optimization framework. We evaluate the work using various benchmarks, and measure a 17% speedup over gcc -O3 on an Athlon MP, and a 13% speedup on a Pentium 4.Singapore-MIT Alliance (SMA

DSpace@MIT

Memory Hierarchy Hardware-Software Co-design in Embedded Systems

Author: Ge Zhiguo
Lim H. B.
Wong Weng Fai
Publication venue
Publication date: 01/01/2005
Field of study

The memory hierarchy is the main bottleneck in modern computer systems as the gap between the speed of the processor and the memory continues to grow larger. The situation in embedded systems is even worse. The memory hierarchy consumes a large amount of chip area and energy, which are precious resources in embedded systems. Moreover, embedded systems have multiple design objectives such as performance, energy consumption, and area, etc. Customizing the memory hierarchy for specific applications is a very important way to take full advantage of limited resources to maximize the performance. However, the traditional custom memory hierarchy design methodologies are phase-ordered. They separate the application optimization from the memory hierarchy architecture design, which tend to result in local-optimal solutions. In traditional Hardware-Software co-design methodologies, much of the work has focused on utilizing reconfigurable logic to partition the computation. However, utilizing reconfigurable logic to perform the memory hierarchy design is seldom addressed. In this paper, we propose a new framework for designing memory hierarchy for embedded systems. The framework will take advantage of the flexible reconfigurable logic to customize the memory hierarchy for specific applications. It combines the application optimization and memory hierarchy design together to obtain a global-optimal solution. Using the framework, we performed a case study to design a new software-controlled instruction memory that showed promising potential.Singapore-MIT Alliance (SMA

DSpace@MIT

An Interpolative Analytical Cache Model with Application to Performance-Power Design Space Exploration

Author: Peng Bing
Tay Yong Chiang
Wong Weng Fai
Publication venue
Publication date: 01/01/2005
Field of study

Caches are known to consume up to half of all system power in embedded processors. Co-optimizing performance and power of the cache subsystems is therefore an important step in the design of embedded systems, especially those employing application specific instruction processors. In this project, we propose an analytical cache model that succinctly captures the miss performance of an application over the entire cache parameter space. Unlike exhaustive trace driven simulation, our model requires that the program be simulated once so that a few key characteristics can be obtained. Using these application-dependent characteristics, the model can span the entire cache parameter space consisting of cache sizes, associativity and cache block sizes. In our unified model, we are able to cater for direct-mapped, set and fully associative instruction, data and unified caches. Validation against full trace-driven simulations shows that our model has a high degree of fidelity. Finally, we show how the model can be coupled with a power model for caches such that one can very quickly decide on pareto-optimal performance-power design points for rapid design space exploration.Singapore-MIT Alliance (SMA

DSpace@MIT

Hierarchical Multi-Bottleneck Classification Method And Its Application to DNA Microarray Expression Data

Author: Hsu Wen Jing
Wong Weng Fai
Xiong Xuejian
Publication venue
Publication date: 01/01/2003
Field of study

The recent development of DNA microarray technology is creating a wealth of gene expression data. Typically these datasets have high dimensionality and a lot of varieties. Analysis of DNA microarray expression data is a fast growing research area that interfaces various disciplines such as biology, biochemistry, computer science and statistics. It is concluded that clustering and classification techniques can be successfully employed to group genes based on the similarity of their expression patterns. In this paper, a hierarchical multi-bottleneck classification method is proposed, and it is applied to classify a publicly available gene microarray expression data of budding yeast Saccharomyces cerevisiae.Singapore-MIT Alliance (SMA

DSpace@MIT

Bit-Packing Optimization for StreamIt

Author: Agrawal Kunal
Amarasinghe Saman P.
Wong Weng Fai
Publication venue
Publication date: 01/01/2003
Field of study

StreamIt is a language specifically designed for modern streaming applications. A certain important class of these applications operates on streams of bits. This paper presents the motivation for a bit-packing optimization to be implemented in the StreamIt compiler for the RAW Architecture. This technique aims to pack bits into integers so that operations can be performed on multiple bits at once thus increasing the performance of these applications considerably. This paper gives some simple example applications to illustrate the various conditions where this technique can be applied and also analyses some of its limitations.Singapore-MIT Alliance (SMA

DSpace@MIT